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Getting Started 


Statistics can truly strike fear into the hearts of a student in many fields of 
study. The study and understanding of statistics is important for a variety of 
subjects and areas of research, including mathematics, medicine, 
pharmacology, biology, ecology, and many others. Any area of study that 
involves the compilation and manipulation of data relies on statistics for 
interpretation of data. 

Data involves the collection of raw numbers that can be the number of 
persons in the population with a specific disease, age ranges in a population, 
effects of a certain drug, percentage of animals in an ecosystem, and other 
things that involve numbers. The goal of many research projects is to take 
“quantitative data” from an experiment and to make sense of that data. In this 
book, we will look at simple statistical problems as they apply to different 
situations, boil them down to the mathematics involved, and learn how to 
interpret data as a tool in the understanding of what the data really means. 

The trick to data is that it can be manipulated in a variety of ways. Why 
mention the word “trick” in this explanation? Because the goal of researchers 
is to make a point based on the data they have collected. They put up graphs 
and make claims that are accurate to a certain number of percentage points or 
with a certain degree of confidence. This involves phrases like “standard 
deviation” and “95 percent confidence interval”, and so on. If you don’t 
know what these phrases mean, you really can’t know whether the data 
makes any sense at all or whether the researchers are trying to “trick” you 
into agreeing to their conclusion by manipulating the numbers in a way that 
favors the point they want to make. 

In short, your understanding of statistics in this book could make the 
difference between your being “a sucker” and buying into the “explanation” 
of data given to you versus being a knowledgeable scholar who knows how 
data can be manipulated in any piece of research you read. Let’s start with the 
basics of statistics—the kind you might remember from grades school—and 
work our way up to more complex concepts involved in statistics so you can 
be a true scholar in whatever field of endeavor you are studying. 



Chapter 1: 

Statistics 101—All the Stats 
you Learned in Grade School 


Statistics involves the collection of data that can be collected in number form. 
It isn’t just saying “it was a nice day outside” but rather something like 
comparing the number of minutes of the day that were rainy versus the 
number of minutes of the day that were sunny. While collecting these data 
points and manipulating them, the researchers can then summarize these 
numbers in different ways to make their point that, in their conclusion, “it 
was a nice day”. Data points can be discrete numbers or a range of numbers, 
which are manipulated differently as you’ll come to understand. 

The first things you need to know about the problem listed above are the 
basics of sample size, mean, median, and mode. If you measure the rainfall in 
millimeters per minute for every minute of the day, you are doing a census, 
in which every minute was counted. The same is true if you ask the age of 
every person in the country to get their age and manipulate the data. In the 
first problem, it wouldn’t be hard to get a census as there are only 1,440 
minutes in a day. It is a lot harder to find the ages of everyone in the country; 
so you would use a sample of the population and ask the people in the 
sample their ages. This is an important: ALWAYS pay attention to the 
sample size and make sure it represents the population being sampled. If it’s 
very small, the data may not apply to the larger number of people the sample 
is representing. 

Example, you sample 6 people who take a drug and find that 2 of them got 
hives from the drug. Can you then go on to say that 33.3 percent of people 
get hives from taking the drug? Would it matter if the researchers had a 
sample size instead of 100 people? What about 1000 people? Of course, it 
would matter! I Again, look at the sample size. 

Data can be discrete or continuous. In the above sample, the data involves 
taking a discrete number of people who got hives. In determining the rainfall 
per minute, you might get 1 millimeter per minute, 1.3 millimeters per 
minute, or 1.333 millimeters per minute. The number of people who got hives 
represents discrete data, while measuring the rainfall is measuring 
continuous data (involving a range to a certain percentage point). 



Now that you understand what you are collecting as it refers to data, let’s 
look at a set of data points so you know what certain key statistics terms 
mean. 

Problem 1. You sample 5 people to find their height to the nearest inch and 
have these numbers: 

42, 64, 65, 65, 61 

What are the mean, median, and mode of this sample? The mean is the 
average number. To find the mean, add the numbers and divide by the total 
number of individuals being sampled. In this case, the mean is 

(42 + 64 + 65 + 65 + 67)/5 = 60.6 (making this a continuous 
data value). The median is the middle number when arranging them from 
smallest to largest: in this case it is 65 inches. The median can be different 
from the mean. The mode is the number that appears the most often among 
all the numbers in the data set. In this set, the mode is also 65 inches. 



Problem 2. You are having a birthday party for your son Jack, who is 12. 
There are 9 children coming and you want to choose an age-appropriate 
activity for the party. Here are the data points, representing the ages of the 
participants: 

1, 2, 2, 2, 12, 12, 13, 13, K 

What’s the mean age of the participants and will this help you determine an 
age-appropriate activity? 

Answer: Mean = 

(1 + 1 + 2 + 2 + 12 + 12 + 13 + 13 + 13)/9 = ^ years. This 

would indicate that maybe a bouncy house would be appropriate but, the 
toddlers can’t participate and the older kids probably won’t want to 
participate. Skip the bouncy house. What about the median? This is the 
middle number if arranged in order, which is 12 years. Maybe consider 
bowling? Again, this would work for half the kids or so but not for the other 
half. The mode or the most frequent number is 13. Maybe bowling would 
work after all because it would make the most children happy. 

Problem 3. Let’s take the example of rainfall in a given day by taking a 
sampling of 9 different times of the day and recording the rainfall per minute 
to the nearest millimeter. Was it a nice day or not? 

0. 0. 0, 2, 4. 6. 10, 10. 1( 

Answer: Let’s look at the mean rainfall per minute in millimeters by adding 
the numbers and dividing by 9: the mean is 4.67 mm per hour. Maybe it 
wasn’t that good of a day because it was rainy. The median is 4 mm per hour; 
again, it was raining so it might not have been a good day. The mode is 
tricky, though. You have two values: 0 and 10. This is called a “bimodal 
distribution”. You could argue that for most of the day it was rainy or that 
most of the day was sunny. These kinds of distributions make it difficult to 
interpret the data. 



Problem 4. Abigail gets her grades back and wants to know if she is on the B 
honor roll. She needs an average of 80 percent or more of her grades to make 
the honor roll. Did she make it? 

1. Math: 92 percent 

2. English: 81 percent 

3. Social Studies: 54 percent 

4. French: 65 percent 

5. Geography: 84 percent 

Answer: You want to know the mean (average) of her grades. The mean is 
the value of all the numbers added together and divided by 5. The answer is 

(92 + 81 + 54 + 65 + 84)/5 - 75.2 

honor roll. 

Problem 5. You want to know what the number of residents in a nursing 
home who have colds and compare it to the national average. You know the 
national average for colds each month is 18 percent of individuals over the 
age of 65 years in a given year. Is the percentage of patients in your nursing 
home who have colds a cause for alarm if your nursing home has exactly 100 
residents? 


Monti 

1 Ja 

1 Feb 

Mar 

Apr 

May 

Jun 

Jul 

Aug 

Sep 

Oct 

Nov 

Dec 

Cases 

2£ 

32 

16 

8 

1 

0 

0 

0 

7 

15 

19 

30 


Answer: Start by getting the mean. Add the total and divide by 12: This adds 
to a total of 157 patients with colds and an average of 13 cases of the 
common per month in a given year. This equates to 13 percent. One could 
argue that this is definitely not a cause for alarm based on the national 
average. Later on, weTl talk about whether or not a claim can be made that 
the number of cases in your nursing home is a significant or not. The median 
is found by rearranging the numbers in order and finding the middle number. 

0, 0, 0, 1, 7. 8, 15, 16, 19, 29, 30, 3: 

In this case 8 and 15 are the middle numbers. To find the median when there 
is an even number of numbers in the data set, add the two middle numbers 
and divide the sum by 2. Thus, the median is 11.5 still below the national 
average percent of 18 percent. 

Problem 6. A drug is given to a patient and the amount of the drug in the 
system is considered to be 86 nanomoles per milliliter of blood at 0 minutes. 
You measure the concentration of the drug over time. What is the mean 




concentration of the drug? What is the half-life or iy 2 of the drug? 


Minutes 

0 

30 

60 

90 

120 

150 

180 

210 

240 

270 

Cone. 

86 

73 

62 

54 

43 

32 

21 

11 

5 

0 


Answer: Add the values of the concentration and divide by 10 as there are ten 
points measured. This leads to a mean concentration of 38.7 nanomoles per 
liter. The half-life is when half of the drug is gone. This would be when 86 
divided by 2 or 43 percent, which is 120 minutes. Now, it gets more 
complicated if the drug is given orally so that the concentration at 0 minutes 
would not be the highest value. The above example would work for an 
intravenous drug only. 


























Problem 7. Now we will estimate the mean from grouped data. You can 
estimate the median this way as well. Let’s say that you have the height of a 
large sample of people over the range of centimeters and you want to know 
the mean and median. Here’s your data: 


Height Range (cm) 

Number of Persons 

150-154 

5 

155-159 

2 

160-164 

6 

165-169 

8 

170-174 

9 

175-179 

11 

180-184 

6 

185-189 

3 


So, now what? To make a good estimate, make a new table in which you 
have the middle height of each group and the number of people in each group 
as the “range” of each group. Multiply the middle number of each group by 
the number of people in that group. Then divide the total number by 50 
because that’s how many people are in the sample you have. This leads you 
to this summary of the data: 


Height Range (cm) 

Number of 
Persons 

Middle Height 

Height X 
number 

150-154 

5 

152 

760 

155-159 

2 

157 

314 

160-164 

6 

162 

972 

165-169 

8 

167 

1336 

170-174 

9 

172 

1548 

175-179 

11 

177 

1947 

180-184 

6 

182 

1092 

185-189 

3 

187 

561 

Total 

50 

Total: 

8530 

























































Now, divide 8530 by 50 to get an estimated mean of 170.6 cm. 

The median is somewhere (the mean) between the 25* and 26* person out of 

50 people or somewhere in the range of ^ ^ centimeters. The median 

can be calculated using this formula: 

n 

2- B 

Median = L + “^(w) 

Where L is the lower class boundary number of the group that contains the 
median, n is the number of people in that group, B is the cumulative 
frequency of the groups before the median group, G is the frequency of the 
median group, and w is the group width. 

In this case, these are the values: 

L - 169.5 outside the bounds of the known median number of 
centimeter range) 

n = 50 

B = 5 + 2+ 68 = 21 
G = 9 

w 

= 5 (the number of centimeters in the sub interval 170 - 
174 

) 

Plugging these values into the formula, you get: 

50 

7-21 

■Median = 169.5 + ~ (5) 

Calculating this, you get this: Estimated median = 171.7 centimeters 

Okay, so that was a little bit more complicated but you get the idea. This is a 
way to estimate the mean and median of from grouped data. 



Problem 8. Age is a bit different when determining the mean in a range. For 
example, if the age range is 0-9 years, you can still be 9 years old up until 
your tenth birthday. This makes the middle of this range not 4.5 years but 5 
years of age. Let’s do one more example: 

You want an age distribution of 112 people in a small village and you get the 
values in this table when arranging the data: 


Age Range 

Number 

Mean Age 

Multiplied 

0-9 

20 

5 

100 

10-19 

21 

15 

315 

20-29 

23 

25 

575 

30-39 

16 

35 

560 

40-49 

11 

45 

495 

50-59 

10 

55 

550 

60-69 

7 

65 

455 

70-79 

3 

75 

225 

80-89 

1 

85 

85 

Total 

112 

Sum: 

3360 


The estimated mean is ^360/112 29.7 

So, what’s the estimated median? It’s the mean of the age of the 56* and 57* 
person so it is somewhere between 20 and 29. Using the formula: 

n 

2 - B 

Median = L + “^(w) 

^ (just at the lower limit of the median number of age range) 

n = 112 

B = 20 + 21 =41 
G = 23 

^ ~ (the number years in the age group) 




Median = 20 + 


23 


( 10 ) 


The estimated median = 26.5 years 

Let’s do the estimated mode as well, even though it isn’t as important of a 
number when compared to the mean and median. In the above case, it is 
probably in the 

20-29 

- year category but might actually not be if it is bimodal or a fluke 


happened in the actual data. The formula to calculate the mode ir 
this type of distribution is: Estimated Mode 


f - L 


-L + 


(f_ - L J + (f - Li,) 

tti ID - I m m T r 


is the lower class boundary of the modal group 


f, 


m - 1 1 


is the frequency of the group before the modal group 
’ is the frequency of the modal group 


f, 


m + 1 1 


W 


is the frequency of the group after the modal group 
is the group width 


Okay, so let’s plug the values for these frequencies into the formula: 

23 - 21 

Estimated Mode = 20 + (23 - 21 ) + (23 - 
The estimated mode is 21.4 years. 

Anyway, it isn’t a perfect system but you still can get the idea. 




Problem 9. You measure the length of ten items in cm and get this range. 
What is the mean length in centimeters, given this data and the fact that you 
have 10 items? 


Range 

Value 

15-19 

2 

20-24 

7 

25-29 

1 

Total 

10 


Answer: Let’s expand the table: 

Range 

Value 

Middle 

Value 

15-19 

2 

17 

34 

20-24 

7 

22 

154 

25-29 

1 

27 

27 

Total 

10 

Sum: 

215 


This leads to a mean of 215 divided by 10 or 21.5 centimeters. 


Problem 10. You are doing an experiment on rabbits and need their 
estimated mean weight (pounds). This table shows the data you have on their 
weight. What is the average weight in pounds of your 10 rabbits? 


Range 

Value 

Middle 

Value 

5-9 

2 

7 

14 

10-14 

5 

12 

60 

15-19 

3 

17 

51 

Total 

10 

Sum: 

125 


This leads to an average or mean weight of 125/10 or 12.5 pounds. 













































Problem 11. You are doing an experiment and have the ages of the people in 
your sample. What is the average and median age of the people in the sample 
in your study? 


Age Range 

Number 

Mean Age 

Multiplied 

0-9 

15 

5 

75 

10-19 

18 

15 

270 

20-29 

20 

25 

500 

30-39 

15 

35 

525 

40-49 

12 

45 

540 

50-59 

9 

55 

495 

60-69 

5 

65 

325 

70-79 

4 

75 

300 

80-89 

2 

85 

170 

Total 

100 

Sum: 

3200 


In this case, the mean is 3200/100 or 32 years of age. 

So, let’s estimate the median. It is the number between the 50* and 
person, so it is between 20 and 29 years of age. Using the equation: 

n 

2- B 

Median = L + “^(w) 

(just at the lower limit of the median number of age range) 
n - 100 (^total number of participants) 

B - 15+ 18-33 (^number before the median group) 

G = 20 (number in the median group of 20 - 29^ 

^ ~ (the number years in the age group) 

50 - 33 

Median = 20 + ^i^(lO) = 28.5 

years 

So, now you know how to manipulate data (discrete and ranges of data) to get 
the mean, median, and mode. Let’s go on to do a few more important 




statistical analyses. 

What we know about data is that it isn’t always neat. Data can be points 
along a line and the line might not always be perfect. Let’s do some scatter 
plots and try to find an equation that best fits the plotted data. This is 
important if you want to estimate the data point of a place on the line that 
didn’t match exactly with the data point you collected. 

Problem 12. You are giving a drug and you know the peak concentration 
varies with the amount of drug given. You want to know the linear 
relationship between the values. Here are the values: 


Dose (g) 

Concentration 

14.2 

215 

16.4 

325 

11.9 

185 

15.2 

332 

18.5 

406 

22.1 

522 

19.4 

412 

25.1 

614 

23.4 

544 

18.1 

421 

22.6 

445 

17.2 

408 




Let’s graph the data points and draw a prediction line: 
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The goal is to get a line that best represents the linear relationship and has 
about the same number of dots above and below the curve. This line is called 
the best-fit line or line of best fit. You can now use the line estimate what the 
concentration will be at a certain dose. For example, at 20 mg dose, you’d get 
a concentration of about 460 mg/L. 

Just so you know, the line goes by the graph of y = mx + b where y is the 
concentration, m is the slope of the line, x is the dose in mg, and b is the y- 
intercept. Some data sets give a line of best fit that has a y-intercept that does 
not make sense because it is outside of possible data points. In such cases, the 
y-intercept merely becomes a placeholder used to graph the line or to 
determine the equation of the line. The actual slope can be calculated for 
better accuracy. The process is called the least squares regression. We’ll do 
it but just for a few points as it gets really messy to use all the points: 

•-y 

1. For each (x,y) point, you will need to calculate x and xy. 

2 . Then add all the x, y, x , and xy values separately. This means the sum of all x-values, 

y 

y-values, xy-values, and x -values. 




n( ■ xy) - ■ X ■ y 


3. Calculate the slope using this equation: ■ x ; t ■ x; case, n = the 

number of points and 2 is the sum of the values. 

■ y - ni( ■ x) 

b = 

4. Calculate the y-intercept: 

Let’s do three data points on the curve that seem to be close to the line: 

• (14.2, 215) = 201.64 xy = 3053 

• (18.5, 406) x^ = 342.25 xy = 7511 

• (25.1, 614) x^ = 630.01 xy = 15,411.4 


■ X = 57.8 
■y = 1235 

■ x^ = 1173.9 

■ xy = 25,975.4 


m 

Let’s calculate the slope: 


m 


6543.2 

180.86 


36.18 


3 ( 25 , 975 . 4 ) - ( 57 . 8 )( 1235 ) 
3 ( 1173 . 9 ) - ( 57 . 8 )^ 


1235 - 36 . 18 ( 57 . 8 ) 
b " 3 


285 


This leaves the line equation to be: 36.18x ■ 285 y 

Now, this can’t be completely right because, it would lead to a concentration 
of -285 mg/1 if you didn’t give any medicine but, you get the idea. This is 
basically what researchers have to do to make estimates of the slope of a line 
on a scatter graph. 

Problem 13. You gather the data of the heights of some subjects in your 
sample and you get this histogram. How many subjects were greater than 60 
inches tall according to what you know on the histogram? 




50 55 60 65 70 75 80 


Height in inches 

The answer, according to the histogram is 15. Now what percentage of kids 
are over 60 inches tall? According to the histogram, this is 15/22 = 68.1 
percent. 












What is the average height of each student? You have to use the ranges of 
50 “ 54, 55 - 59^ ^ build a table like the tables we’ve used before. 


Height 

Subjects 

Mean Ht. 

Multiplied 

50-54 

3 

52 

156 

55-59 

4 

57 

228 

60-64 

6 

62 

372 

65-69 

5 

67 

335 

70-74 

3 

72 

216 

75-79 

1 

77 

77 

Total: 

22 

Sum: 

1384 


The mean is 1384/22 = 62.9 inches 


Problem 14. You measure the lengths of several objects. You create this 
histogram of the data. What is the total number of objects and what 
percentage of objects are greater than or equal to 22 millimeters? 



The total number of objects is 100 and the percent of objects greater than or 
equal to 22 inches = 74/100 or 74 percent. 


















Problem 15. Let’s look at a problem regarding the weighted mean. What if 
some values have more weight than others or are considered more important? 
in this problem, you are taking a course and you need to know your grade. 
The assignments count for 30 percent of the grade, the midterm counts for 20 
percent, and the final exam counts for 50 percent of the grade. You need 70 
percent to pass the test. Did you pass? Here are your scores: 


Course 

Score 

1.00 

Assignments 

85 

0.30 

Midterm 

72 

0.20 

Final 

61 

0.50 


The weighted mean equals the weighted percent of each score added 
together: 

85(0.3) + 72(0.2) + 61(0.5) = 25.5 + 14.4 + 30.5 = 70.4percent. 
Whew!! You just passed! 




Chapter 2: 

Ranges and Standard Deviations 


In this chapter, we’ll start with something easy and work up to standard 
deviations, which are something you hear about in research papers but can be 
confusing if you don’t remember the basics of what these things mean. 
Standard deviations and ranges give you the opportunity to look at a grouping 
of data points and know whether the data points are tightly clustered around a 
single point or are scattered all over the place and really don’t mean as much. 

Problem 16. You have a grouping of data points about rainfall averages in a 
given location. You want to know if it is consistently rainy every day at the 
location or whether there is a wide range of rainfall amounts in the area. Your 
data points are: 

15, 21, 57, 43, 11, 39, 56, 83, 77, 11, 64, 91, 18, 37 mm of rain per day 

What’s the range? This is easy; find the range by identifying the smallest 
number and the largest number in the group and subtract the smallest from 
the largest. The largest number is 91 and the smallest number is 11. 
Subtracting them gives a range of 80. 



Problem 17. Let’s try it with a graph. You decide to record the temperature 
one afternoon and you collect a set of data points. The graph is shown. What 
is the range in this situation? 
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This is actually a very easy problem. You take the largest number, which is 
50 degrees and subtract the smallest number, which is 10 degrees. This gives 
a range of 40 degrees. 




Problem 18. Let’s look at quartiles. Quartiles gives you an idea of the 
spread of the data. Sometimes researchers will say that the “upper quartile” of 
individuals eating vegetables each day have a lower risk of heart disease 
compared to the “lower quartile”. This allows the research to make a broad- 
reaching statement about something when the exact data points cannot really 
tell the story. 

Quartiles involve breaking up the data into four quarters or sections with the 
data points arranged in ascending order. Take this set of numbers: 1 , 3 , 3 , 4 , 

5, 6, 6, 7, 8, 8 

Find the median number. In this set, the median is halfway between 5 and 6, 
or a median of 5.5. This is also considered the middle quartile or 

^2 

. The lower quartile is the first 25 percent of the numbers 


This place in the data set is between 3 and 3, which leads to th 


numbers, 

this way: 


first quartile, or Q| 

of 3. The upper quartile is the last 25 percent of 

in this case the third quartile, orQo . ^ ^ , 

IS 7. Look at it 


Quartiles 



4 , 5 , 


6 , 6 , 



Lower 

Quartile 


t 


Upper 

Quartile 


Middle 

Quartile 


Next, we will discuss the interquartile range, which is the range of the 
numbers between the lower quartile number and the upper quartile. In this 





case, the interquartile range is 7 - 3 or 4. 

Problem 19. Let’s learn how to draw a box and whisker plot or box plot as 
this makes a visual of the data set using the quartiles. First, look at a set of 
data points and determine the median, the lower quartile and the upper 
quartile: 

8, 11, 20, 10, 2, 17, 15, 5, 16, 15, 25, 6 


First arrange them in ascending order: 2, 5, 6, 8, 10, 11, 15, 15, 16, 17, 20, 25 

The median is 15 as this is the middle number. Splitting the first half of the 
list into two equal parts gives a lower quartile of halfway between 6 and 8, or 
7, while the upper quartile is halfway between 16 and 17 or 16.5. 


Use these numbers to draw the box and whisker plot. Start with putting a 
number line to identify the smallest number, the largest number, the median, 
and the quartiles. The plot begins with a line from the smallest value to the 
lower quartile. Then continue the plot with a box from the lower quartile to 
the median and another box from the median to the upper quartile. Then, 
continue the plot with a line from the box to the value of the largest number. 
The box whisker plot for the data set looks like this: 


Lower 

Value 


0 


Box and Whisker Plot 



Median Upper 
Quartile 


10 


15 


20 


Upper 

.Value 


25 30 


Think of this as a way to visually see how the data is dispersed or how the 
data is skewed. The box shows that the median value is far to the right of the 

interquartile range. The interquartile range is ^ ^ - ^ ^ ~ 

Problem 20. What is the lower quartile, the median, and the upper quartile of 
this group of numbers? 

13, 18, 6, 20, 25, 11, 9, 18, 3, 30, 16, 9, 8, 23, 26, 17 

First arrange them: 3, 6, 8, 9, 9, 11, 13, 16, 17, 18, 18, 20, 23, 25, 26, 30 







The median is between the 8* and 9* numbers, so the median is 16.5. Then 
cut the first and last halves of the data set in halves again. The lower quartile 
is between the 4* and 5* numbers, so the lower quartile is 9. The upper 
quartile is between the 12* and 13* number, so the upper quartile is 21.5. The 

interquartile range is or 12.5. 

Now, weTl talk about percentiles and deciles. The percentile would be 
similar to the quartiles but it represents the percent of the total that a value is 
in. You’ve heard about a child being in the 10* percentile for height, or the 
85* percentile for height. Being in 

X percentile means that x number of values are less than 
X and 100 - X values are greater thanx 

Deciles involve breaking up the values into ten equal parts. The first decile is 
the 10* percentile or the first ten percent of the data, the second decile is the 
20* percentile or the first 20 percent of the data, and so on. 

Regarding quartiles, the lower quartile or “first quartile” is the 25* 
percentile; the “second quartile” or the median quartile is the 50* percentile; 
the upper quartile or the “third quartile” is the 75* quartile. 



Problem 21. You are planning a survey on family size and collect a sampling 
of the family size of your patients. What is the first quartile, second quartile 
and third quartile based on the frequency? 


Family 

Size 

Frequency 

0 

2 

1 

4 

2 

5 

3 

4 

4 

3 

5 

1 

8 

1 


This is tricky. You have to write out these numbers completely, according to 
frequency. This leads to this: 

0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 8 
Now identify the middle number which is the second quartile: 2 
Identify the first quartile by splitting the first half in half: 1 
Locate the third quartile by splitting the second half in half: 3.5 











Problem 22. Let’s identify the quartiles from a graph, at the graph shows 
population growth. When did the population reach about 10 percent of its 
growth and when did it reach about 50 percent of its growth? 
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Answer: The peak population was 5400 in the year 2000. The lowest 
population was 800. The difference is 4600. Population growth at 10 percent 
of 4600 was 460. You must add the initial low of 800 to get 1260. This 
happened in about 1921. At 50 percent growth, you get half of 4600 or 2300. 
Add this to 800, you get 3100, or about 1960. 

Problem 23. Now let’s look at mean deviation. The mean deviation is how 
far the different values are from the middle on average. It is the average of 
the absolute values of the distances of the data points in a set from the mean 
of the set. We’re getting closer to understanding things like “standard 
deviation”. This involves finding the mean, the distance of the data points 
from the mean, and the average of this distance. Let’s look at these data 





















































































































































































points: 

3, 6, 6, 1 , 8, 11 , 15, 16 


First find the mean, which is the average of all the data points: ^2/8 - 9 
found the mean just as you did in chapter one by adding all the values and 
dividing by the total number of data points. Now we’ll make a table of the 
“distance from the mean”: 


Value 

Distance from Mean 

3 

6 

6 

3 

6 

3 

7 

2 

8 

1 

11 

2 

15 

6 

16 

7 


Now, calculate the average of the distance from the mean by adding the 
second column of numbers and obtaining its average: 30/8 - 3.75 




Problem 24. What is the mean deviation of this set of numbers? 

75, 83, 96, 100, 121, 125 

First calculate the mean: 100 (add all the numbers and divide by 6). Now 
make a table of the distances from the mean: 


Value 

Distance from Mean 

75 

25 

83 

17 

96 

4 

100 

0 

121 

21 

125 

25 


The mean deviation is the average of the distance from the mean: 

92/6 = 15.33 

Now weTl get into the topics of variance and standard deviation. The 
standard deviation is a value that tells how the numbers in a group of data 
points are spread out from the mean. A low standard deviation means that 
most of the values are extremely close to the mean. A high standard deviation 
means that the values are more spread out. 




Problem 25. First let’s look at the variance. This is the squared differences 
from the mean. Start by calculating the mean as you already know how to do 
Then subtract the mean from each number in the set and square the results. 
This is called the squared difference. Then average the squared differences. 
To get the standard deviation, square the average of the squared differences. 

Let’s calculate the standard deviation for this set of numbers: 

75, 83, 96, 100, 121, 125 

This set has a mean of 100. Now let’s do a table of the deviations from the 
mean and the squared deviations: 


Value 

Deviation from Mean 

Squared Deviation 

75 

25 

625 

83 

17 

289 

96 

4 

16 

100 

0 

0 

121 

21 

441 

125 

25 

625 


The total of the squared deviations is 1996. Divide by 6 to get the mean of 
332.7; The square root of 332.7 is 18.2. 

























Problem 26. You are testing a class and students have these scores. What is 
the standard deviation of the mean? 

These are the scores: 

23%, 37%, 45%, 49%, 56%, 63%, 63%, 70%, 72%, 82% 

First, find the mean by adding the numbers and dividing by 10. The mean is 
56%. 

Now do a table of the deviations from the mean and the squared deviations: 


Value 

Deviation from Mean 

Squared Deviation 

23 

33 

1089 

37 

19 

361 

45 

11 

121 

49 

7 

49 

56 

0 

0 

63 

7 

49 

63 

7 

49 

70 

14 

196 

72 

16 

256 

82 

26 

676 


The sum of the squared deviations is 2846, with an average of 284.6. The 
square root of this number is 16.87. 






































Problem 27. What is the standard deviation of this set of 12 numbers? 

271, 354, 296, 301, 333, 326, 285, 298, 327, 316, 287, 314 

calculating the mean is the first step: Add the numbers and divide by 12: is 
the mean is 309. 

Now, create a table of values, deviations from the mean, and the squared 
deviations. 


Value 

Deviation from Mean 

Squared Deviation 

271 

38 

1444 

354 

45 

2025 

296 

13 

169 

301 

8 

64 

333 

24 

576 

326 

17 

289 

285 

24 

576 

298 

11 

121 

327 

18 

324 

316 

7 

49 

287 

22 

484 

314 

5 

25 


The sum of the squared deviations is 6146, and the average of these numbers 
is 6146/12 = 512.2. The standard deviation is the square root of 512.2 or 
22.63. 


Okay, so there is an easier way to do this. There is, of course, a formula to 
calculate the standard deviation of a data set. The first formula, below, 
calculates the standard deviation of a population and the second formula 
calculates the standard deviation of a “sample” of the population. 

Calculate the standard deviation of a population using this formula: 



N 
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O 












































where N is the size of the population; = is each individual number; p is the 
mean. So basically, this is exactly what we’ve been doing in long form. There 
are calculators that can calculate the standard deviation if you enter in the 
numbers into the calculator. 

If you have a population and then just do a sampling of the population instead 
of using the entire population, and calculate the standard deviation of the 
sample, the calculation is slightly different. 

Calculate the standard deviation of a sample using this formula: 


s = 


In this case, n is sample size; but instead of p (the mean of the population), ^ 
is the mean of the sample. 

So, what this means is that, while we will do things the “long” way, there are 
standard deviation calculators that will do the calculation using a formula for 
you or on a scientific calculator or online. 




Problem 28. What is the standard deviation of this set of numbers (which 
happen to be the first 10 numbers of the Fibonacci sequence)? 

0, 1 , 1 , 2 , 3 , 5, 8, 13, 21, 34 

The mean is this: 8.8. Create a table of the values, deviations from the mean, 
and the squared deviations. 


Value 

Deviation from Mean 

Squared Deviation 

0 

8.8 

77.44 

1 

7.8 

60.84 

1 

7.8 

60.84 

2 

6.8 

46.24 

3 

5.8 

33.64 

5 

3.8 

14.44 

8 

0.8 

0.64 

13 

4.2 

17.64 

21 

12.2 

148.84 

34 

25.2 

635.04 


The sum of the squared deviations is 1095.6 with the average of the sum is 
109.56. The standard deviation is the square root of 109.56 or 10.467. 





































Chapter 3: 

False Positives, False Negatives, 
and The Confidence interval 


First, we will take a look at testing and the situations where there is a false 
negative or a false positive. For this, you need to know that there are true 
positives, true negatives, false positives, and false negatives for every test and 
many other situations. When looking at information on tests, you need to 
know what the percent chance is that a test is an accurate reflection of the 
“true positivity” or “true negativity” of a given test. Here’s how you do it: 

Problem 29. Your patient thinks he has an allergy to house dust. The allergy 
testing for house dust is 90 percent true for people who have the allergy (so 
the answer will be truly positive 90 percent of the time). The test will also say 
that the person has the allergy 5 percent of the time to people who do not 
have the actual allergy. (This is the false negative rate). 

The incidence of allergies in the population is 1.3 percent. What is the 
percent chance that the person with a positive test for allergies really has the 
allergy to house dust? 

Create this table: 



The test is positive 

The test is negative 

The person is allergic 

90 percent true positive 

10 percent false 
negative 

The person is not 
allergic 

5 percent false positive 

95 percent true negative 


You can then diagram this out: 


90 percent True Positive = 1.3 x 0.9 = 1.17% 


Yes 1-3 Percent 


Percent who has the allergy 

No 


10 percent false negative = 1.3 x 0.1 =0.13% 
98.7 Percent 95 percent true negative =98.7 x .95 = 93.8% 


5 percent false Positive = 98.7 x 0.05 = 4.9% 







Add the positive percentages to get 4.9 percent + 1.17 percent to get 6.07 
percent; however, only 1.17 percent are truly positive so you have to divide 
1.17/6.07 (the total positive percentages), which gives a percent chance of 
actually having the allergy being only 19.3 percent with a positive test. 


Problem 30. Use the above data to describe whether or not a negative test 
really means that he really doesn’t have the allergy: 

0.13 percent + 93.8 percent = 93.93 percent total negatives so that 

93.8/93.93 - 99.9 percent. This makes this test a good screening test 
because it identifies the negatives 99.9 percent of the time but isn’t a good 
test for diagnosis because a positive test only able to give a 19.3 percent 
chance of being accurately positive in the person who has a positive test 
(according to problem 29). 



Problem 31. You have a test for a virus that has the following characteristics: 


The test is positive The test is negative 
There is a real virus 97 percent true positive 3 percent false negative 

There is not a real 2 percent false positive 98 percent true negative 

virus 

If the incidence of the virus in the population is 1 percent, what is the chance 
that a positive test is truly accurately positive? 


In the case of a 1 percent incidence situation, this leads to a 


.97 


1 = 0.97 


percent positive rate 


In the case of a 1 percent incidence situation, this leads to a 


0.03 


0.03 


percent negative rate 


In 99 percent not with the virus leads to a ^ percent 97.02 

percent negative rate 


In 99 percent not with the virus leads to ^ ' 2 percent 1.98 

percent positive rate 

Adding the positive tests leads to 0.97+1.98 percent or 2.95 percent, although 
only 97 percent are true positives so 0.97/2.95=32.9 percent accuracy of a 
positive test. 


The reverse for an accurate negative test is ^ ■ ^2 - 97.05 which 

only 97.02 are truly negative, leading to a 99.9 percent accuracy for a 
negative test. Again, the test is better for determining that there is no virus 
but isn’t as accurate in saying that the virus is present. 













Confidence Intervals 

Now we will talk about confidence intervals. Confidence intervals involve 
being a certain percent level of confidence that the value you have decided is 
the mean is actually where the mean really is. Thus, it is an interval or range 
of values where you say the mean is in the data set. The idea behind this is to 
have as narrow a range as possible in which you would ideally (as a 
researcher) be able to state your claim. If you claim that a certain range 
contains the mean value but your 95 percent confidence interval is quite 
wide, you don’t have as good a leg to stand on when compared to the 
situation where you have a narrow 95 percent confidence interval. 

Let’s look at some problems, but first you need to know what the z-score is. 
The z-score measures the distance from the mean, in standard deviations of a 
value in the data set. The z-score helps establish confidence intervals. Most 
people use 95 percent or 99 percent confidence intervals, which will indicate 
that the claim regarding a mean value is 95 percent or 99 percent accurate, 
given the spread of the data and the number of data points you take in order 
to arrive at the mean. This table is necessary to make your calculations 
correct: 


Confidence Interval 

Z-Score 

80% 

1.282 

85% 

1.440 

90% 

1.645 

95% 

1.960 

99% 

2.576 

99.5% 

2.807 

99.9% 

3.291 


The formula for identifying the confidence interval of something is “the mean 
+/- x” is this: 

zs 

Mean ± ~t 

• ^ is chosen from the table 

. s 


is the standard deviation 




is the number of observations 


Everything after the — sign is called the margin of error. 

Problem 33. The size of the heart is measured in 30 randomly-selected 
children. The mean was determined to be 91 mm with a standard deviation of 
8 mm. What is the 85 percent confidence interval (assuming a normal, non- 
skewed pattern of distribution)? 


zs .. 

i ~T = i = 2.1 

Use formula: The margin of error is 

This leads to the 85 percent confidence interval around the mean of 
91 + / - 2.1 or 88.9 - 93.1 

Problem 34. You are weighing a group of 6* graders and find that 20 of 
them have a mean weight of 105 pounds. The standard deviation is 15 
pounds. What is the 99.5 percent confidence interval? 


(1.44)(8) 


Mean = 105 ± 


(2.807)(15) 


or 105 ± 9.41 


pounds 

The 99.5 percent confidence interval is ■114.41 

Problem 35. You are measuring the times of 8 runners who are sprinting the 
100-meter run and get a mean of 9.84 seconds and a standard deviation of 
0.08 seconds. What is the 99.9 percent confidence interval for this mean? 


Mean =9.84 ± 


(3.290(0.08) 


= 9.84 ± 0.093 


The 99.9 percent confidence interval is ^ 9.93 

Problem 36. You are reading a novel that had 500 pages. You randomly 
choose 25 pages and determine that there are 323 words to a page with a 
standard deviation of 38.4 words. What is the mean number of words with an 
80 percent confidence interval (i.e. what is the range of the mean with 80 
percent confidence that you are right)? 

(l.282)(38.4) 

323 ± -— = 323 ± 9.85 


Calculations: 



332.8 


313 1 

The 80 percent confidence interval is ' ■ ' 

Problem 37. You are a teacher and have ten students. The mean grade was 5 
points out of 10 and the standard deviation was 2.18. What is the 85 percent 
confidence interval of the mean to two decimal places? 

zs 

Mean ± “r 

Answer: 


5 


(l.44)(2.18) 


+ 


V 


5 ± 0.99 


The 85 percent confidence interval of the mean is 


4.01 


5.99 



Problem 38. Here is a set of data. What is the mean, standard deviation, and 
95 percent confidence interval? (Okay, so this gets a little harder). 


33, 40, 44, 45, 49, 52, 54, 57, 58 Start by getting the mean: 48 


Make a table of values, deviations from the mean and squared deviations: 


Value 

Deviation from Mean 

Squared Deviation 

33 

15 

225 

40 

8 

64 

44 

4 

16 

45 

3 

9 

49 

1 

1 

52 

4 

16 

54 

6 

36 

57 

9 

81 

58 

10 

100 


The sum of the squared deviations is 548 with the average of the sum is 
60.89. The standard deviation is the square root of 60.89 or 7.8. 

ZS (1.96)(7.8) 

Mean ± “r = 48 ± c = 48 ± 


Now do the calculations: 


5.1 


The 95 percent confidence interval is ^ 53.1 

Problem 39. This is remake of a previous question but with a twist. You 
need now to give the mean to the 95 percent confidence interval: 

These are the first 10 numbers of the Fibonacci sequence. What is the mean 
to the 95 percent confidence interval? 

0, 1, 1, 2, 3, 5, 8, 13, 21, 34 

The mean is this data set is 8.8 

Now create a table of values, deviations from the mean, and squared 
deviations. 


Value 

Deviation from Mean 

Squared Deviation 

0 

8.8 

77.44 







































1 7.8 60.84 


1 7.8 60.84 


2 6.8 46.24 


3 5.8 33.64 


5 3.8 14.44 


8 0.8 0.64 


13 4.2 17.64 


21 12.2 148.84 


34 25.2 635.04 


The sum of the squared deviations is 1095.6. The average of the sum is 
109.56. The standard deviation is the square root of 109.56 or 10.467. 


(1.96)(10.467) 

Mean = 8.8 ± - r= - = 8-8 ± 6.487 

JlO 


The 95 percent confidence interval is ^ ^ ^ ^ (which is a very wide 

interval considering the values of the data points) 






























Problem 40. We will again revive another question and will look at the 95 
percent confidence interval Of the following set of data: 

271, 354, 296, 301, 333, 326, 285, 298, 327, 316, 287, 314 
Find the mean: Add the numbers and divide by 12; the mean is 309. 

Now create a table of values, deviations from the mean, and squared 
deviations. 


Value 

Deviation from Mean 

Squared Deviation 

271 

38 

1444 

354 

45 

2025 

296 

13 

169 

301 

8 

64 

333 

24 

576 

326 

17 

289 

285 

24 

576 

298 

11 

121 

327 

18 

324 

316 

7 

49 

287 

22 

484 

314 

5 

25 


The sum of the squared deviations is 6146, with an average of 6146/12 = 
512.2. The standard deviation is the square root of 512.2 or 22.6. 


ZS (1.96)(22.6) 

Mean ± t = 309 ± - r ;— = 309 ± 12.8 

^12 

The 95 percent confidence interval is 3^6.2 - 321.8 

Now that you have mean, standard deviations, and 95 percent confidence 
intervals, let’s look at something even more complex. 












































Chapter 4: 

Chi Square Analyses or 
Chi Square Tests 


Consider the case where you have two different groups and have gotten the 
values for each group. Is the difference between the two groups significant or 
is it just a random difference that doesn’t mean anything? The Chi Square 
analysis can tell you if your data really means anything. 

What the Chi Square test gives you is the “p-value”. The p-value is used to 
test the validity of a research claim that is made about a given population. It 
is a number that is set to between 0 and 1. The lower the p-value, the greater 
is the chance that your set of numbers is significant when comparing two 
populations. 

The claim being made is referred to as the alternative hypothesis and the p- 
value tests this hypothesis. This would indicate that there is a true difference 
between the two groups. The opposite claim to the alternative hypothesis is 
the “null hypothesis”, which is that the claim is untrue and the values for the 
two groups are statistically equivalent. If the null hypothesis is true, there 
would be no difference between the groups. 

As a researcher, you want your p-value to be as low as possible. These 
statements are true about the p-value: 

• A small p-value (generally less than or equal to 0.05) means that 
there is strong evidence against the null hypothesis, so you can 
claim the alternative hypothesis. 

• A large p-value (generally greater than 0.05) means that there is 
weak evidence against the null hypothesis, so you can claim the 
alternative hypothesis. 

• Those p-values very close to 0.05 are considered to go either 
way. 

• Always report the p-value in your research so that the readers are 
able to make their own conclusions about your claim. 

Why is the p-value set to 0.05? This is basically arbitrary. You could use 0.01 
to be even more sure that the data set A is different from data set B. 

Let’s do an example: 



Problem 41. You do a test of men and women and determine that among 
women, 231 prefer cats and 242 prefer dogs. Among men, 207 prefer cats 
and 282 prefer dogs. Is this a statistical difference or not? 

Let’s figure it out: 


Gender 

Cats 

Dogs 

Summation 

Men 

207 

282 

489 

Women 

231 

242 

473 

Summation 

438 

524 

Total: 962 


Now you will calculate the “expected” total for each category. You do this by 
multiplying the sum of the rows and the sum of the columns by the total 
number that has been collected (962). It’s best to do this in the table: 


Gender 

Cats 

Dogs 

Summation 

Men 

489(438)7962 

524(489)7962 

489 

Women 

438(473)7962 

524(473)7962 

473 

Summation 

438 

524 

Total: 962 




































Gender 

Cats 

Dogs 

Summation 

Men 

222.64 

266.36 

489 

Women 

215.36 

257.64 

473 

Summation 

438 

524 

Total: 962 


Next, take the actual number minus the “expected number”, square it and 
divide by the expected number. The results are in this table: 


Gender 

Cats 

Dogs 

Men 

1.099 

0.918 

Women 

1.136 

0.949 


These need to be added to get the Chi Square: 1.009 + 0.98 + 1.136 + 0.949 
4.102 

2 

(Observed - Expected) 

The formula for Chi Square = the sum of the Expected 

Look up the p-value in a table but it is based on “degrees of freedom”. The 
degrees of freedom are the (number of columns minus one) x (number of 
rows minus one) = 1. There is also a p-value calculator that can be found 
online, https://www.socscistatistics.com/pvalues/chidistribution.aspx 

The p-value according to the table is .042833. The result is significant at p < 
0.05. It would not be significant if the p < 0.01. The link above gives the 
same result and is very helpful. 
















Problem 42. You are surveying a group of individuals if they like pop music 
or rock music and you record the results in this table: 


Gender 

Pop Music 

Rock Music 

Boys 

176 

63 

Girls 

215 

46 


What is the Chi Square and the p-value if you believe that there is a 
difference between the two? 

First, set up a table that lists the values in the situation and the summation of 
the numbers: 


Gender 

Pop Music 

Rock Music 

Summation 

Boys 

176 

63 

239 

Girls 

215 

46 

261 

Summation 

391 

109 

500 


Next, set up the expected numbers that you would exist if, in fact, there is no 
difference between the preferences between males and females: 


Gender 

Pop Music 

Rock Music 

Summation 

Boys 

(239)391/500 

(239)109/500 

239 

Girls 

261(391)7500 

(109)261/500 

261 

Summation 

391 

109 

500 


These are the values that would exist if there were no differences between 
their preferences: 


Gender 

Pop Music 

Rock Music 

Summation 

Boys 

186.9 

52.1 

239 

Girls 

204.1 

56.9 

261 

Summation 

391 

109 

500 


Chi square analysis: Actual number minus the expected number squared and 
divided by the expected number:_ 


Gender 

Pop Music 

Rock Music 

Boys 

0.636 

2.28 






























































Girls 


0.582 


2.09 


Add these together to get the Chi square: 5.588 


Use the table with 1 degree of freedom to get this: The p-value is 0.18084. 
The result is significant at p < 0.05. What this means is that there truly is a 
difference between the musical preferences between boys and girls. 


Problem 43. You have taken a survey of individual likes and dislikes 
regarding food preferences. The survey included three separate types of food, 
which makes it a little more complicated because there is more than one 
degree of freedom. Find the Chi square and the p-value for these data points: 


Age 

Chicken 

Burgers 

Chinese 

20 or less 

106 

119 

25 

Over 20 

117 

141 

92 


The question at hand is whether these are statistically different from one 
another. Let’s start with a table of values with summations: 


Age 

Chicken 

Burgers 

Chinese 

Summation 

20 or less 

106 

119 

25 

250 

Over 20 

117 

141 

92 

350 

Summation 

223 

260 

117 

600 


What would be expected if there was no difference between the two groups? 


Age 

Chicken 

Burgers 

Chinese 

Summation 

20 or less 

92.9 

108.3 

48.75 

250 

Over 20 

130.1 

151.7 

68.25 

350 

Summation 

223 

260 

117 

600 


2 

(Observed - Expected) 


calculate the Chi square using this formula: Expected 


Age 

Chicken 

Burgers 

Chinese 

20 or less 

1.85 

1.06 

11.57 

Over 20 

1.32 

0.75 

8.26 


Add these numbers to get the Chi square: 24.81 


































































p-value to given that the number of columns minus 1 is 2 and the number of 
rows minus 1 is 1 (2 x 1) = 2 leads on the table to this: The p-value is < 
0.00001. The result is significant at p < 0.05. This means that the data involve 
statistically significant values. 

Problem 44. In an experiment, you flip a coin 50 times and get 20 heads and 
30 tails. What is the Chi square and p-value of this experiment? Start with a 
table: 



Observed 

Expected 

Heads 

20 

25 

Tails 

30 

25 


2 

(Observed - Expected) 

Chi square: This is the sum of Expected 

(20 - 25 )^ (30 - 25 )^ 

"I ~ 2 

The Chi square is the sum of 25 25 

P-value (with two degrees of freedom): The p-value is 0.157299. The result is 
not significant at p < .05. 




Problem 45. A die is tossed 60 times with the following results: 

7 ones, 11 twos, 10 threes, 12 fours, 9 fives and 11 sixes 

Is this a loaded dice or are the results of the tosses within the expected range? 

2 

(Observed - Expected) 


Remember that the Chi sq 

[uare = the summation of the 

Roll 

Observed 

Expected 

Chi Square 

1 

7 

10 

0.9 

2 

11 

10 

0.1 

3 

10 

10 

0 

4 

12 

10 

0.4 

5 

9 

10 

0.1 

6 

11 

10 

0.1 


Expected 


Summation of Chi square of each roll is the summation of the last column: 

1.6 


Calculate the p-value with 5 degrees of freedom: The p-value is 0.901249. 
The result is not significant at p < 0.05. This means that it is probably not a 
loaded die. 


Problem 46. You do a survey and identify the number of people who pass 
their drivers’ test on their first attempt or their second attempt. You want to 
know if the result is significant and if there is a gender difference among 
those passing their drivers’ test. The table shows the data from the survey: 


Gender 

1®‘ Attempt 

2nd Attempt 

Summation 

Men 

126 

211 

337 

Women 

135 

178 

313 

Summation 

261 

389 

650 


Expected Results: 


Gender 

First Time 

2"^ Time 

Summation 

Men 

135.3 

201.7 

337 

Women 

125.7 

187.3 

313 

Summation 

261 

389 

650 





(Actual-Expectedf divided by expected to get the Chi square values: 


Gender 

First Time 

2"‘^ Time 

Men 

0.64 

0.43 

Women 

0.69 

0.46 


Add these up to get a Chi square of 2.22. 

The p-value with one degree of freedom is:0.136233. The result is not 
significant at p < 0.05. This means that gender does not affect the ability to 
pass a drivers’ test. 



Problem 47. In a survey, children were asked to give their favorite color. The 
results of the survey are in the table. Find out if the results are significant and 
if there is a difference between boys and girls and their favorite color: 


Gender 

Blue 

Green 

Yellow 

Sum 

Boys 

63 

126 

7 

196 

Girls 

85 

91 

28 

204 

Sum 

148 

217 

35 

400 


So, what would be expected if there was no difference in their opinions? 


Gender 

Blue 

Green 

Yellow 

Sum 

Boys 

72.5 

106.3 

17.2 

196 

Girls 

75.5 

110.7 

17.8 

204 

Sum 

148 

217 

35 

400 


(Actual-Expected)^ divided by Expected to get the Chi square values: 


Gender 

Blue 

Green 

Yellow 

Boys 

1.24 

3.65 

6.05 

Girls 

1.20 

3.51 

5.84 


Add the Chi square values: 21.49 

There are two degrees of freedom: The p-value is 0.000022. The result is 
significant at p < 0.05. 
















Problem 48. You take a survey to see how many people by age can and 
cannot swim. The question is whether age plays a role in whether or not a 
person can swim. Is the information you have gathered significant? The data 
from the survey is in this table. 


Age 

Can Swim 

Can’t Swim 

Sum 

20 or less 

39 

17 

56 

21-40 

47 

54 

101 

41-60 

27 

18 

45 

61 or older 

27 

21 

48 

Sum 

140 

110 

250 


Expected Values: 


Age 

Can Swim 

Can’t Swim 

Sum 

20 or less 

31.4 

24.6 

56 

21-40 

56.6 

44.4 

101 

41-60 

25.2 

19.8 

45 

61 or older 

26.9 

21.1 

48 

Sum 

140 

110 

250 


(Actual-Expected)^ divided by Expected to get the Chi square values: 


Age 

Can Swim 

Can’t Swim 

20 or less 

1.84 

2.35 

21-40 

1.63 

2.08 

41-60 

0.13 

0.16 

61 or older 

0.00 

0.00 


Sum of these: 8.19 

P-value with 3 degrees of freedom: The p-value is 0.042244. The result is 
significant at p < .05. This basically means that age does affect the ability to 
swim. 








































































Problem 49. You are assessing whether or not there is a statistical difference 
between age and the wearing of glasses. You have taken a survey and the 
data points are in this table: 


Age 

Glasses 

No glasses 

Sum 

20 or under 

15 

56 

71 

21-40 

19 

37 

56 

42-60 

34 

35 

69 

61 or older 

51 

23 

74 

Sum 

119 

151 

270 

Expected value if no differences: 

Age 

Glasses 

No glasses 

Sum 

20 or under 

31.3 

39.7 

71 

21-40 

24.7 

31.3 

56 

42-60 

30.4 

38.6 

69 

61 or older 

32.6 

41.4 

74 

Sum 

119 

151 

270 

Chi square Analysis: 



Age 

Glasses 

No glasses 


20 or under 

8.49 

6.69 


21-40 

1.32 

1.04 


42-60 

0.43 

0.34 


61 or older 

10.39 

8.18 



Sum of the Chi squares = 36.88 

In this case, there are 3 degrees of freedom (4 - 1) x (2-l) = 3 

From the table: The p-value is < 0.00001. The result is significant at p <0 .05. 
This means that there is a difference between people in a certain age group 
and the need to wear glasses. 










Problem 50. Let’s go back to one of our original word problems. You have a 
nursing home with 100 residents and you measure the number of colds each 
month. The national average is 18 percent per month or 18 cases per month 
per 100 people during the year. Is your claim that your mean is only 13 
percent per month significant? 


Month 

Jan 

Feb 

Mar 

Apr 

May 

Jun 

Jul 

Aug 

Sep 

Oct 

Nov 

Dec 

Cases 

29 

32 

16 

8 

1 

0 

0 

0 

7 

15 

19 

30 

Expected 

18 

18 

18 

18 

18 

18 

18 

18 

18 

18 

18 

18 

Chi Sq. 

6.7 

10.9 

.22 

5.6 

16.1 

18 

18 

18 

6.7 

0.5 

.06 

8 


Add the Chi squares up: 108.78 

P-value with 11 degrees of freedom: The p-value is < 0.00001. The result is 
significant at p < 0.05. This means that you can claim that your nursing home 
“cold” average is statistically less than the national average. 

We hope you enjoyed this guide! If so, can you leave a review on the 
Amazon book page? It would be greatly appreciated! It will help more 
students to see the value they can get from this guide. 

If you have any suggestions on ways to improve this book, please contact us 
at: support@mathwizo.com 
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